BMC Medical Informatics and Decision Making
○ Springer Science and Business Media LLC
Preprints posted in the last 30 days, ranked by how well they match BMC Medical Informatics and Decision Making's content profile, based on 39 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.
Bressman, E.; Auerbach, A.; Keniston, A.; Jens, C.; Ranji, S.
Show abstract
Introduction: The use of artificial intelligence (AI) by clinicians has increased rapidly in recent years, with large language models (LLMs) emerging as tools that can equal clinician diagnostic performance in simulated settings. However, limited data exist regarding physicians use of LLMs in real-world clinical practice. This study aimed to evaluate the frequency of LLM use among practicing hospitalists, identify which LLMs are most commonly utilized, and assess hospitalists' perceptions of the benefits and limitations of LLM use in clinical care. Methods: We conducted a cross-sectional survey study of academic hospital medicine faculty across 8 institutions within the Hospital Medicine Reengineering Network (HOMERuN), a collaborative research consortium. Eligible participants included hospitalists practicing within participating HOMERuN sites during the study period. The survey assessed the frequency of LLM use, types of LLMs used, clinical applications, and physician perceptions regarding usefulness, efficiency, and concerns associated with LLM adoption. Results: 170 respondents (67.1%) reported ever using an LLM in clinical practice. Among LLM users, OpenEvidence was the most used tool (88.9%), followed by ChatGPT (58.5%), Google Gemini (26.9%), and Microsoft Copilot (20.5%). Only a minority of hospitalists reported using LLMs daily while seeing patients. The most common use cases of LLMs were answering diagnostic (77.1%) and management (77.6%) questions. A majority also reported using LLMs to identify or summarize primary literature (60.0%). Lack of trust in outputs (49.8%), uncertainty around institutional policies (48.6%), and lack of access to secure applications (43.1%) were cited as the most frequent barriers to using LLMs in practice. Discussion: The use of LLMs in clinical practice is already widespread, though regular or daily use is not yet typical. Concerns regarding reliability, patient privacy, and safe integration into clinical workflows remain significant barriers to broader adoption. The responsible implementation of LLMs in hospital medicine will require addressing these barriers.
Sozol, S. S.; Dev Nath, B. C.; Fahim, F. M. S.; Suzana, N. N.; Mirza, J. F.; Ahmmed, S.; Zohra, F.-T.; Zafr, A. H. A.; Uddin, M. N.; Mondal, M. R. H.; Hoque, A. S. M. L.
Show abstract
Machine learning (ML) is being considered to help diagnose cardiovascular diseases (CVD). Still, challenges like inconsistent and limited datasets, limited infrastructure, and global inequalities lead to the need for a reliable and practicable ML solution. This paper presents an ML-driven framework for predicting CVD risk scores and classifying status. Several data preprocessing techniques, including multiple imputation by chained equations (MICE), outlier removal, are considered. In addition, hyperparameter tuning is performed with the GridSearchCV tuning technique. Moreover, a consensus-driven five-feature selection method is applied to identify optimal predictors. The dataset used in this study contains healthcare records related to future CVD risk scores, comprising 1,529 patient records with 22 features. The optimized stacked ensemble model is applied to the dataset and achieves a cross-validated coefficient of determination value of 98.13% for CVD risk score regression. Comparative evaluation with other ML models confirmed improved accuracy, efficiency, and interpretability. The explainable AI technique SHAP is applied to interpret predictions and highlight key risk factors. Moreover, a deployment-ready web platform with multi-role access has been developed that demonstrates clinical applicability. The proposed framework offers a reliable and interpretable tool for early detection of CVD and personalized risk assessment. In the future, this work can be extended to integrate longitudinal data, medical imaging, and deep learning to improve generalizability and strengthen real-world impact.
Nakagawa, S.; Yamamoto, A.
Show abstract
Cross-national alignment of branded food databases is essential for international nutritional epidemiology but lacks standardized methods. Existing approaches - including food ontologies, domain-specific fine-tuned language models, and manual expert mapping - require either substantial infrastructure or do not scale to thousands of items. We propose an unsupervised evaluation framework for large language model (LLM)-based food database alignment that requires no ground-truth labels. Using the Japan Branded Food Database (JBFD; 9,519 items, 71 mid-level categories) and USDA FoodData Central (448 categories) as a case study, we introduce two complementary metrics: weighted centroid distance (nutritional proximity between matched category pairs) and dominant category share (structural consistency of category-level assignments). We then conducted a systematic ablation study across eight input conditions (A-H), varying combinations of product name, nutrient profile, and semantic category label. Results showed that nutrient-only inputs yielded poor structural consistency despite low centroid distances, while semantic category labels achieved the highest dominant category share (89.3%) but introduced circularity due to their LLM-derived origin. Among circularity-free conditions, product name combined with minimal nutrient information (energy, protein, salt; condition E) achieved the best balance of centroid distance (0.471) and dominant category share (65.8%). Model comparison across Claude Haiku, Sonnet, and Opus confirmed that NO_MATCH rates were consistent across model sizes (12-14%), suggesting that prompt design contributes more to alignment quality than model scale. These findings provide practical guidance for input design in LLM-based food database alignment without ground-truth annotation.Sonnet 4.6
Naderalvojoud, B.; Sutjiadi, B. J.; Koul, A.; Curtin, C.; Gevaert, O.; Hernandez-Boussard, T.
Show abstract
Background Machine learning (ML) models are increasingly used to predict adverse outcomes after surgery. However, most rely on static patient characteristics (e.g., age, comorbidities) and overlook clinician-controlled treatment decisions that can be actively modified at the point of care. Discharge opioid prescribing is a key modifiable, clinician-controlled decision, yet optimizing prescribing choices across multiple adverse outcomes remains underexplored in predictive modeling. This study addresses that gap by introducing a novel ML framework that explicitly separates fixed patient risk factors from modifiable prescribing options to support personalized, risk-informed opioid prescribing decisions. Methods We developed the Hierarchical Clinical Fusion Transformer (HCF-Transformer), an ML model designed to estimate patient-specific risks across four postoperative outcomes: prolonged opioid use (POU), chronic pain (CP), 30-day readmission, and opioid-associated outcomes (OAO). The model constructs patient risk profiles from fixed, non-modifiable baseline factors, followed by a transformer layer. Clinician-controllable discharge opioid regimens are modeled as alternative intervention candidates and fused with the fixed risk representation through a clinical fusion mechanism, enabling assessment and ranking based on predicted risks. A Total Relative Risk (TRR) metric, calibrated to each outcome prediction threshold, guides the recommendation process. We evaluated the model in diabetic surgical patients, a common high-risk population. Results The study included 157,853 unique diabetic surgical patients, with outcome prevalences ranging from 47.2% (POU) to 1.8% (OAO). The HCF-Transformer achieved the highest AUROCs, 0.798 for POU, 0.712 for 30-day readmission, 0.808 for CP, and 0.922 for OAO, outperforming Random Forest, FT-Transformer, and ResNet-based models. Compared to these baselines, HCF-Transformer generated more stable and discriminative risk estimates and demonstrated significant variation in TRR scores across discharge opioid options (ANOVA p < .01, eta-squared > .01). This enabled consistent identification of lower-risk regimens tailored to patient-specific profiles. Conclusions The HCF-Transformer introduces a novel hierarchical fusion approach to optimize opioid prescribing by integrating static patient risk profiles with modifiable discharge options. Using transformer-based modeling and a quantifiable TRR metric, the model delivers personalized, risk-aware recommendations. This approach enables data-driven opioid prescribing tailored to individual risk and has the potential to improve postoperative outcomes in high-risk populations. Our findings demonstrate that integrating modifiable factors with structured risk profiles through a transformer-based fusion architecture can enhance decision-support systems, paving the way for more actionable and personalized AI in healthcare.
Baroud, S.
Show abstract
Migraine detection and sentiment analysis in healthcare have become increasingly important, particularly with the rise of social media platforms like Twitter, where users often share their personal health experiences. This study presents MASHA (Multi-Agent System for Healthcare Sentiment Analysis), an artificial intelligence (AI)-driven framework that integrates multiple machine learning (ML) models for sentiment analysis of Arabic tweets related to migraines. The system leverages a multi-agent architecture to handle tasks such as data acquisition, pre-processing, model training and real-time decision-making. Key ML models, including Support Vector Machines (SVM), Naive Bayes (NB) and Logistic Regression (LR), are integrated using ensemble techniques, leading to improved classification performance. Experiments conducted on a dataset of Arabic tweets demonstrate that MASHA outperforms traditional methods, achieving an accuracy of 90.0% and an F1-score of 89.46%. Moreover, the system's scalability and flexibility make it suitable for real-time public health monitoring, offering valuable insights into patient experiences and public sentiment regarding healthcare services. MASHA's adaptability suggests its potential application for analysing other healthcare-related conditions, reinforcing the system's scalability and broader relevance. Future work will focus on incorporating deep learning (DL) models and expanding the dataset with content from additional social media platform.
Gatto, J.; Yang, J.; Seegmiller, P.; Rahat, R.; Burdick, T.; Preum, S. M.
Show abstract
Patient portal messaging has become a primary channel for asynchronous clinical communication, it spans a wide range of content, from symptom reports and medication concerns to administrative requests. Despite this volume and diversity, there is no formal representation for what a portal message contains: no vocabulary for the clinical and administrative events it describes, or for the attributes of those events that the patient has actually disclosed. Without such a representation, it is difficult to systematically analyze portal communication, assess message completeness, or build downstream tools that depend on structured input, such as automated triage, response drafting, and follow-up question generation. A clinical event schema, grounded in real portal messages and reviewed by clinicians, would provide this missing foundation. We introduce a clinical event ontology for patient portal messages, containing 8 event types and 70 roles that span clinical content (symptoms, medications, diagnostic tests, treatment responses, patient history) and administrative content (medical needs, logistics, social factors). The ontology was developed iteratively in collaboration with clinical expert and human evaluation. As a downstream application, we use the ontology to characterize the event types and roles most frequently sought in clinician follow-up questions, which provides insight of what clinicians ask about when reading portal messages.
Ye, L.; Lyu, B.; Yang, Q.; Mou, X.; Nawawonganun, R.; Laohasiriwong, W.
Show abstract
Background: Multi-drug resistant Bacterial (MDRB) Infections in the intensive care units (ICUs) substantially elevate patient mortality, prolong hospital stays, and impose heavy healthcare cost burdens. Existing predictive models for ICU-acquired MDRB infection predominantly focus on static admission-risk assessment, lacking the capacity to leverage longitudinal treatment data for dynamic risk re-stratification during the ICU stay. Meanwhile, most models suffer from poor clinical interpretability, overreliance on hard-to-collect biomarkers, or absence of deployable clinical tools, limiting real-world translation. Therefore, there is an urgent need to develop a parsimonious, interpretable tool based on routine cumulative data to guide timely intervention. This study aimed to develop a interpretable model with a web calculator to improve clinical applicability. Methods: In this study, we conducted a retrospective analysis of ICU inpatients at the First Affiliated Hospital of Dali University between January 1, 2023, and January 1, 2026. Using the create Data Partition function in R software (random seed = 42), the dataset was stratified and divided into a training group and a validation group in a 7:3 ratio. Feature selection was performed using the Boruta algorithm to validate variable rationality. A multivariable logistic regression model was constructed and visualized as a nomogram, and its performance was compared with six machine learning algorithms (Random Forest, XG Boost, Neural Network, etc.). Model validation was conducted using receiver operating characteristic curves (ROC), Decision Curve Analysis (DCA), and SHAP value interpretation. Finally, an online R Shiny calculator was developed based on the final model. Results: A total of 3,631 patients were enrolled and divided into a training group (n=2,543) and a validation group (n=1,088) using stratified random sampling. Five independent predictors were identified in the training group, which were hypertension combined with diabetes, antibiotic types, ventilator days, urinary catheter days, and PCT abnormality times. The Logistic regression model achieved an AUC of 0.772 (95%CI: 0.733-0.812) in the validation group, outperforming XG Boost (0.763) and Random Forest (0.703). The model demonstrated excellent calibration (Hosmer-Leme show {chi}{superscript 2} = 1.94, P = 0.9829) and positive net clinical benefit across threshold probabilities of 0%-40%. SHAP analysis aligned with regression-derived variable importance rankings, confirming predictor contributions. An open-access online calculator was successfully deployed (https://dongfangshao666.shinyapps.io/MDR_shiny2/), enabling real-time individualized risk stratification at the bedside. Conclusion: This study developed and validated a dynamic, interpretable multi-drug-resistant bacterial infection risk prediction model requiring only five routinely collected clinical indicators. The model balances robust predictive performance with high transparency, overcoming key limitations of prior tools. The accompanying web calculator supports dynamic risk reassessment throughout the ICU stay, facilitating precise antimicrobial stewardship, targeted infection control interventions, and optimized resource allocation, bridging the gap between statistical modeling and frontline clinical decision-making.
Enikeev, R.; Moldovan, M.; Chu, M.; Amalraj, A.; Koli, P. P.; Abdul, S. S.; Sivaraj, H.; Iqbal, U.; Toh, C. K.
Show abstract
Background: Structuring oncology clinical notes into registry-grade variables is essential for research and care but remains labour-intensive and error-prone. Objective: To develop and evaluate a privacy-preserving large language model pipeline for oncology registry abstraction in a real-world clinical setting. Methods: We deployed an open-source Meta Llama 3.3 70B-based pipeline to extract over 50 variables from 6,700 oncology notes at a cancer centre in Singapore. Data were de-identified locally using a Hide-In-Plain-Sight approach, ensuring no identifiable data left hospital infrastructure. Performance was assessed on 200 randomly sampled notes with adjudicated ground truth. A structure-aware framework classified outputs as correct, missing, spurious, or incorrect. Results: F1 scores were high across variables, including diagnosis (97.2%), histology (95.8%), stage (92.6%), biomarkers (91.4%), and treatments (88.1%). Transferability testing on 50 external notes showed strong performance for core variables. Conclusions: Privacy-preserving LLMs can achieve near-human-level accuracy for oncology abstraction, with structure-aware evaluation enabling more clinically meaningful assessment. Keywords: Oncology Registry Abstraction, Privacy-Preserving Deployment, Clinical Information Extraction, Structure-Aware Evaluation, Large Language Models, Template-Filling Metrics
Tharzeen, A.; Vafaei Sadr, A.; Radfar, N.; Hwang, W.; Abedi, V.; Zand, R.
Show abstract
Background: Machine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons. Methods: We developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph. Results: We included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [≥] , 75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity. Conclusions: StrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.
Osborne, T.; Mahmud, T.; Zheng, X.; Jampala, S.; Abbasi, S.; Hong, S.; Kranz, K.; Lee, S.; Ng, P.; Odekon, K.; Schachter, L.; Sexton, R.; Spinnato, T.; Tharakan, M.; Wu, Z.; Wang, F.; Wong, R.
Show abstract
Although large language models (LLMs) have shown promise for discharge summary generation, their value may be greater in longer hospitalizations, where increasing documentation volume and complexity increase both clinician burden and the risk of communication failures during transitions of care. Prior evaluations of LLM-generated discharge summaries have largely involved shorter stays and have rarely examined receiving-clinician priorities or incidental finding reporting. We compared LLM-generated and human-authored discharge summaries for 60 Internal Medicine hospitalizations lasting 7 to 21 days, with paired assessment by hospitalists and primary care physicians (PCPs). Clinician reviewers preferred LLM-generated summaries for 95% of encounters and rated them higher for quality, readability, factuality and completeness. PCPs, the primary recipients responsible for post-discharge care, found that LLM-generated summaries were better for understanding and communicating hospital care to patients, and providing follow-up care. LLM-generated summaries had fewer annotated errors, primarily due to fewer omissions, without increased estimated harm potential or likelihood compared with human-authored summaries. Benefits of LLM-generated summaries were especially salient for PCPs, who identified more omissions with greater downstream likelihood of harm than hospitalists. This underscores the importance of designing transition documents around the needs of clinicians assuming care post-discharge. LLM identification of radiology incidental findings was generally accurate and appropriate, suggesting potential to improve follow-up of clinically relevant findings. These findings extend prior work by demonstrating clinical value of LLMs in summarizing longer, complex hospitalizations and highlighting the value of stakeholder-centered design in clinical AI systems. Together, they support supervised LLM-assisted discharge summarization as a tool to reduce cognitive burden, improve documentation quality, and enhance transition-of-care communication.
Islam, N.; Luo, C.; Tong, J.; Weller, G.; Polleya, D. A.; Kent, A.; Bair, S.
Show abstract
Introduction In analyses of time-to-event data, clinical characteristics can have non-linear impacts on survival outcomes, and understanding this dynamic behavior is crucial for producing real-world evidence (RWE). Nonetheless, estimating these dynamic effects is inherently challenging when utilizing real-world data (RWD), especially since sharing individual-level patient data (IPD) is heavily restricted due to regulatory limitations. Additionally, computational difficulties are exacerbated by the high dimensionality, inter-dependency, rarity, sparsity, and scarcity of features. While data augmentation through collaboration across multiple sites might address these challenges, such collaboration is often infeasible and hindered by regulatory measures that protect patient privacy, thereby preventing the sharing of IPD between sites. Objectives To address this challenge, we propose a privacy-preserving regularized algorithm that eliminates the necessity of aggregating any protected health information across sites. This algorithm employs a penalized federated additive model utilizing piecewise exponential survival (FAMES) data and estimates non-linear effects of features while accounting for non-varying confounding effects. The model is flexible and can accommodate both multiple and multivariate smooth effects simultaneously. Methods The proposed model transforms survival data into a piecewise exponential data (PED) structure and casts the semi-parametric optimization problem into a generalized additive modeling framework assuming Poisson distribution. The model uses orthonormal splines to approximate non-linear effects and incorporates L2-norm based penalty terms to control the smoothness and goodness-of-fit of these effects. The algorithm is optimized using site-specific aggregated summary statistics and is solved iteratively through the Newton-Raphson method. Results The model is employed to assess the smooth effects of clinical features, such as age and numeric laboratory values, on overall survival using RWD from approximately 874 newly diagnosed Acute Myeloid Leukemia (AML) patients treated at seven distinct sites in the United States. The model exhibited non-linear smooth effects for lactate dehydrogenase, platelets, and others underscoring their strong association with disease prognosis. The model demonstrates a lossless property, providing estimates of smooth and fixed effects that are comparable to those derived from the pooled PED. Additionally, the inference of parameters for testing the nullity of effects remains consistent. This model is communication-efficient, necessitating roughly twelve rounds of communication across sites. Conclusion We anticipate that this model can facilitate multisite collaboration and enable smaller sites to participate in generating and validating RWE, especially for rare diseases. While the model was applied within the context of AML, it is disease-agnostic and can be implemented in any other clinical context and across various sites globally without losing any generality.
Shah, K. P.; Airan Javia, S.; Savage, T.; Bressman, E.
Show abstract
End-of-rotation handoffs are critical for patient safety but add to documentation burden for hospitalists. Generative artificial intelligence (AI) may help automate handoff creation using electronic health record data, but its impact on quality and safety is unclear. Methods: We developed an AI handoff tool with a large language model using clinical notes as input and conducted a retrospective evaluation comparing AI-generated and clinician-authored handoffs. Handoffs were assessed across domains of quality and safety through a structured review. Results: Quality ratings were similar between AI and human handoffs (3.7 vs. 3.5, p=0.57). AI-generated handoffs were rated higher for organization (4.4 vs. 4.1, p=0.05) and completeness (4.1 vs. 3.6, p=0.01), but lower for conciseness (3.7 vs. 4.1, p=0.03) and accuracy (4.1 vs. 4.4, p=0.03). Error rates were comparable (0.3/handoff in both groups); however, AI-generated handoffs included inaccuracies (9% of AI errors) and hallucinations (1% of AI errors), while clinician-authored handoffs contained only omissions. Conclusion: Human and AI handoffs have differing error profiles and tradeoffs between completeness and conciseness. Prospective evaluation in clinical workflows is underway.
Dahlberg, A. C. H.; Tapiola, O.; Luisto, R.; Puranen, T.; Sanmark, E.; Vartiainen, V.
Show abstract
Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.
Izzo, J. A.; McIntyre, A. M.; Nguyen, J.; Bashaw, D.; Torrance, C. A.; Foster, J.
Show abstract
Objective: Stigmatizing language in the electronic health record (EHR) has been associated with adverse patient experience in substance use disorder care, including opioid use disorder (OUD). This study evaluated a privacy-preserving, locally-deployed large language model as a method to detect stigmatizing language documentation in OUD patients with patient-directed discharge (PDD). Methods: A retrospective cohort study of 477 inpatient admissions from the MIMIC-IV database with a diagnosis of opioid use disorder were classified using a locally deployed Gemma-4-31b-it-bf16 model and predefined 140 term lexicon to identify stigmatizing language in clinical documentation. Results: Analysis of clinical documentation showed stigmatizing language was present in 84.1% (190/226) in the PDD cohort vs 62.2% (156/251) in the non-PDD cohort, with an unadjusted odds ratio of 3.21 (95% CI 2.07-4.98; p < 0.0001). After adjustment for age, sex, insurance status, marital status, and race, PDD discharge remained an independent predictor of stigmatizing documentation (aOR 2.24, 95% CI 1.40-3.59; p < 0.0001). Further analysis of stigma intensity showed higher stigmatizing markers in the PDD cohort vs the non-PDD cohort (2.85 {+/-} 2.39 vs 2.02 {+/-} 2.44; p < 0.0001). Discussion and Conclusion: Stigmatizing language is detected with increased frequency and prevalence in clinical documentation of OUD patients that initiate PDD compared to those that adhere to standard discharge processes. A locally deployed large language model (LLM) offers a scalable, privacy-preserving method to audit clinical documentation for stigmatizing language.
Alsammani, A.; Johnson, M.; Elrefaei, J.
Show abstract
Objective: To develop, calibrate, and interpret machine learning models for predicting in-hospital mortality among intensive care unit (ICU) patients using clinical data collected during the first 24 hours of admission. Methods: We analyzed 53,866 adult ICU admissions from the MIMIC-IV (v2.2) database, including 5,787 in-hospital deaths (10.7%). An enhanced feature-engineering pipeline generated 88 laboratory-based features that captured distributional characteristics, temporal trends, and measurement frequency. Five machine learning classifiers were evaluated: L2-regularized logistic regression, random forest, XGBoost, LightGBM, and a calibrated soft-voting ensemble. Models were developed using a stratified 64:8:8:20 split for training, validation and hyperparameter tuning, calibration, and testing. Performance was assessed on a held-out test set (n = 10,774) using the area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), Brier score, calibration analysis, decision curve analysis (DCA), and SHAP-based model interpretation. Results: The calibrated ensemble achieved the best overall performance, with an AUROC of 0.856 (95% CI: 0.846-0.867), an AUPRC of 0.449 (95% CI: 0.418-0.480), and a Brier score of 0.078. XGBoost (AUROC 0.856; AUPRC 0.435) and LightGBM (AUROC 0.854; AUPRC 0.436) demonstrated performance comparable to the ensemble and significantly outperformed logistic regression (AUROC 0.823; AUPRC 0.376), yielding absolute AUROC improvements of approximately 0.031-0.033 (p < 0.001). Calibration substantially improved probabilistic predictions, reducing Brier scores by 42% for XGBoost (0.134 to 0.078) and 50% for LightGBM (0.151 to 0.076). Decision curve analysis demonstrated consistent net clinical benefit across the 5%-20% risk-threshold range. Key predictors included age, blood urea nitrogen, ICU subtype, measurement frequency, and lactate-related features. Model performance remained robust across ICU subtypes, with AUROC values exceeding 0.79. Conclusion: A calibrated and interpretable machine learning framework based on early ICU clinical data provides accurate and clinically actionable mortality risk estimates. By integrating trajectory-aware feature engineering, probabilistic calibration, and decision-analytic evaluation, this approach advances ICU mortality prediction toward more reliable and trustworthy clinical decision support systems.
Koumantakis, E.; Remoundou, K.; Fava, C.; Roussaki, I.; Visconti, A.; Berchialla, P.
Show abstract
Intensive Care Unit (ICU) readmissions are associated with adverse clinical outcomes and increased healthcare costs. Although existing models for predicting 30-day ICU readmission show high predictive performance, they fail to account for model uncertainty, potentially resulting in overconfident and unreliable decision-making. We propose a novel Ensemble Bayesian Model Averaging (EBMA)-based framework which balances predictive discrimination with uncertainty by penalizing models that are confident but incorrect. It achieved excellent calibration (Brier score = 0.051), while maintaining discriminatory performance comparable to or exceeding that of the best individual models (AUROC > 0.716). These findings suggest that our EBMA-based framework provides a more robust and clinically reliable approach for ICU readmission prediction and decision support.
Plasek, J. M.; Li, Y.; Amato, M. G.; Foer, D.; Seger, D. L.; Alzaidi, S.; Zhou, H.; Jackson, G. P.; Bates, D. W.; Zhou, L.
Show abstract
Background: Adverse drug events (ADEs) are a critical indicator of patient safety but are often documented only in free-text clinical notes. The potential of recent advances in natural language processing (NLP), particularly generative large language models (LLMs), to identify ADEs remains understudied. This study aimed to compare the performance of multiple LLMs in identifying ADE-Drug relationships in inpatient and ambulatory clinical notes. Methods: We used clinical notes from the 2018 National NLP Clinical Challenge (n2c2) ADE dataset (inpatient; n=505) and from outpatient encounters (n=2,555) between October 1, 2018, and December 31, 2019, at a large academic medical center based in New England. Notes were pre-processed into snippets for model input. Evaluated Models included: GPT-4o, GPT-4o-mini, LLAMA 3.3-70B and their instruction fine-tuned variants (including low-rank adapters for LLAMA). Performance was assessed using both strict and relaxed evaluations (precision, recall, and F1) for all models, followed by manual evaluation (exact semantic match, partial match, missing ADE, drug mention only, not a drug, or wrong) of the two best-performing models. Results: GPT-4o and GPT-4o-mini were the top-performing models among those evaluated. GPT-4o consistently outperformed GPT-4o-mini in ADE extraction across both datasets, with higher F1-scores (0.524 vs. 0.381) and a more balanced precision-recall profile. Both models captured ADEs effectively in explicit and complex clinical contexts, although limitations included misclassification of pre-existing allergies and occasional conflation of therapeutic indications with adverse effects. GPT-4o achieved higher exact match coverage and fewer errors across clinical notes, indicating more reliable performance in both inpatient and ambulatory settings. Conclusion: This work establishes a foundation for integrating LLM methods into real-world drug safety surveillance, with direct implications for improving patient safety.
Zhang, Y.; Trinh, S. H.; Phelan, T.; Byrd, T. F.; Tourani, R.; Kumar, V.; Caraballo, P. J.; Melton, G. B.; Simon, G. J.
Show abstract
Background: Sepsis is a life-threatening condition in which delayed recognition and treatment are associated with increased mortality. While predictive models such as Epic's Early Detection of Sepsis Model (ESM) were developed to support early intervention, their real-world impact after integration into clinical workflows remains difficult to evaluate. Objectives: To evaluate the real-world impact of ESM integrated into clinical workflow on clinical outcomes, antibiotic use, and harm-benefit tradeoffs. Methods: We conducted a quasi-experimental study in a single healthcare system using encounter-level data from inpatient settings. Inpatient mortality, prolonged hospitalization, antibiotic use, and sepsis prevalence were compared between the pre-implementation period (3 June 2023 to 20 August 2024) and the online period (21 August 2024 to 26 December 2024) when the model became visible to clinicians. We also applied a counterfactual framework using models trained on pre-implementation data to estimate expected outcomes without ESM and to quantify harms related to overtreatment and delayed treatment. Results: Among 101,138 encounters, 86,884 occurred during the pre-implementation period and 14,254 during the online period. In unadjusted analyses, the online period had lower inpatient mortality, prolonged hospitalization, antibiotic use, and sepsis prevalence (all p[≤]0.002). In the counterfactual analyses, observed outcomes were lower than expected without ESM for mortality (1.21% vs 1.82%; p<0.001), prolonged hospitalization (5.56% vs 7.95%; p<0.001), and antibiotic use (43.52% vs 47.04%; p<0.001). False positive harm (37.72% vs 41.68%; p<0.001) was also lower than expected. Conclusions: Integration of ESM into clinical workflow was associated with improved patient outcomes, reduced antibiotic use, and decreased harm from overtreatment, without evidence of increased harm from delayed treatment, supporting a positive net clinical benefit and the safety and effectiveness of ESM under Software as a Medical Device principles. Keywords: Machine learning, Electronic health records, Clinical workflow, Counterfactual analysis, Real-world evaluation
Rey-Blanes, A.; Veredas-Morente, J.; Vivas-Vargas, E.; Gil-Garcia, F.; Moreno-Barea, F. J.; Veredas, F. J.
Show abstract
Background and Objective: Access to real-world electronic health records (EHRs) remains limited by privacy, governance and annotation constraints, hindering the development of clinical natural language processing models. Realistic synthetic progress notes may provide EHR-like corpora that preserve clinically rigorous information on diagnoses, treatments, symptoms, imaging, laboratory findings and therapeutic trajectories without relying directly on sensitive patient records. This study evaluates whether large language models (LLMs) can generate realistic Spanish prostate cancer progress notes from published case reports, preserving clinical content, temporality and hospital-style conventions.
Alickovic, F.; Lenz, S.; Ustjanzew, A.; Ortiz Rosario, L.; Vollmar, G. M.; Kindler, T.; Panholzer, T.
Show abstract
Introduction Coding tumor diagnoses from free-text clinical documentation currently requires substantial manual effort. Promising approaches for automating this process include large language mod-els (LLMs), embedding models, and retrieval-augmented generation (RAG). While previous studies often focus on a single method, we directly compare these approaches on a real-world dataset of tumor diagnosis descriptions to assess their strengths and limitations. Methods We evaluated nine different embedding models using similarity search and embedding-based classification, as well as LLM-based coding, with and without RAG, on a real-world dataset of 2,024 unique German tumor diagnosis descriptions labeled with ICD-10 and ICD-O topography codes. The retrieval knowledge base was constructed exclusively from stand-ardized Alpha-ID, ICD-10-GM, and ICD-O-3 classifications. Performance was assessed for exact (full-code) and partial (three-character) code prediction. For RAG, we evaluated base and fine-tuned versions of Llama 3.1 8B and Llama 3.3 70B. Results Qwen3-Embedding-8B, the largest embedding model, yielded the best results. It achieved 47.8% exact-match and 72.1% partial-match accuracy for ICD-10 coding with classification, and 42.7% exact-match and 73.5% partial-match accuracy for ICD-O coding with similarity search. The other embedding models, including medically specialized ones, showed varied but lower performance. RAG improved base LLM perfor-mance and outperformed embedding-based approaches on partial-match accura-cy (80.6% partial-match accuracy for ICD-10 and 75.0% for ICD-O with Llama 3.3 70B), but not on exact-match accuracy. Conclusion A direct comparison with embedding-based approaches is essential to determine whether the additional effort of RAG is justified. The strong variation in performance also highlights the importance of model selection. Further advances in embedding-based methods, potential-ly supported by larger and more diverse training data, may offer a promising direction for future work.